Goto

Collaborating Authors

 microbiome data


DepMicroDiff: Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data

Sadia, Rabeya Tus, Cheng, Qiang

arXiv.org Artificial Intelligence

--Microbiome data analysis is essential for understanding host health and disease, yet its inherent sparsity and noise pose major challenges for accurate imputation, hindering downstream tasks such as biomarker discovery. Existing imputation methods, including recent diffusion-based models, often fail to capture the complex interdependencies between microbial taxa and overlook contextual metadata that can inform imputation. We introduce DepMicroDiff, a novel framework that combines diffusion-based generative modeling with a Dependency-A ware Transformer (DA T) to explicitly capture both mutual pairwise dependencies and autoregressive relationships. DepMicroDiff is further enhanced by V AE-based pretraining across diverse cancer datasets and conditioning on patient metadata encoded via a large language model (LLM). Experiments on TCGA microbiome datasets show that DepMicroDiff substantially outperforms state-of-the-art baselines, achieving higher Pearson correlation (up to 0.712), cosine similarity (up to 0.812), and lower RMSE and MAE across multiple cancer types, demonstrating its robustness and generalizability for microbiome imputation. Microbiome data analysis plays a critical role in understanding host health, disease progression, and therapeutic response, particularly in contexts such as cancer progression, gut-brain interactions, and immunotherapy [1]. However, mi-crobiome datasets, derived from 16S rRNA or metagenomic sequencing, are notoriously sparse and noisy due to limitations in sequencing technologies, biological variability, and compositional constraints.


ADAM-1: AI and Bioinformatics for Alzheimer's Detection and Microbiome-Clinical Data Integrations

Huang, Ziyuan, Sekhon, Vishaldeep Kaur, Guo, Ouyang, Newman, Mark, Sadeghian, Roozbeh, Vaida, Maria L., Jo, Cynthia, Ward, Doyle, Bucci, Vanni, Haran, John P.

arXiv.org Artificial Intelligence

The Alzheimer's Disease Analysis Model Generation 1 (ADAM) is a multi-agent large language model (LLM) framework designed to integrate and analyze multi-modal data, including microbiome profiles, clinical datasets, and external knowledge bases, to enhance the understanding and detection of Alzheimer's disease (AD). By leveraging retrieval-augmented generation (RAG) techniques along with its multi-agent architecture, ADAM-1 synthesizes insights from diverse data sources and contextualizes findings using literature-driven evidence. Comparative evaluation against XGBoost revealed similar mean F1 scores but significantly reduced variance for ADAM-1, highlighting its robustness and consistency, particularly in small laboratory datasets. While currently tailored for binary classification tasks, future iterations aim to incorporate additional data modalities, such as neuroimaging and biomarkers, to broaden the scalability and applicability for Alzheimer's research and diagnostics.


Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data

Aghdam, Rosa, Tang, Xudong, Shan, Shan, Lankau, Richard, Solís-Lemus, Claudia

arXiv.org Artificial Intelligence

The preservation of soil health has been identified as one of the main challenges of the XXI century given its vast (and potentially threatening) ramifications in agriculture, human health and biodiversity. Here, we provide the first deep investigation of the predictive potential of machine-learning models to understand the connections between soil and biological phenotypes. Indeed, we investigate an integrative framework performing accurate machine-learning-based prediction of plant phenotypes from biological, chemical and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved, as evidenced by higher weighted F1 scores, when incorporating into the models environmental features like soil physicochemical properties and microbial population density in addition to the microbiome information. Furthermore, by exploring multiple data preprocessing strategies such as normalization, zero replacement, and data augmentation, we confirm that human decisions have a huge impact on the predictive performance. In particular, we show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. In addition, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. That is, if humans are unable to classify the samples and provide accurate labels, the performance of machine-learning models will be limited. Lastly, we present strategies for domain scientists via a full model selection decision tree to identify the human choices that maximize the prediction power of the models. Our work is accompanied by open source reproducible scripts (https://github.com/solislemuslab/soil-microbiome-nn) for maximum outreach among the microbiome research community.


Interpreting tree ensemble machine learning models with endoR

#artificialintelligence

Background: Tree ensemble machine learning models are increasingly used in microbiome science as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. While such models are often good at predicting phenotypes based on microbiome data, they only yield limited insights into how microbial taxa or genomic content may be associated. Results: We developed endoR, a method to interpret a fitted tree ensemble model. Both the network and importance scores derived from endoR provide insights into how features, and interactions between them, contribute to the predictive performance of the fitted model. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained. We assessed the performance of endoR on both simulated and real metagenomic data.


AI helps explain your microbiome

#artificialintelligence

Can the tiny microbes on your leg reveal your age, or that you smoke? Or that you are menopausal? Billions of microbes live on our skin, they help maintain skin condition, they are our first line of defense from external pathogens and can impact how we respond to treatment. These microbes, which are one tenth the size of human cells, are part of the human microbiome which consists of the collective genome of microbes inhabiting the human body, including bacteria, archaea, viruses, and fungi. A better understanding of our microbiome could help improve overall health and wellbeing and accelerate the development of personalized treatments (including prebiotics, probiotics, and postbiotics).


AI Can Predict your Age Based on Your Microbiome

#artificialintelligence

The human microbiome consists of a community of trillions of micro-organisms, such as bacteria, fungi, viruses, and live all over the body including on the skin, in the mouth and along the digestive tract. A balanced microbiome is important for an individual's health and wellbeing, including proper functionality the digestive and immune systems. The human microbiome is constantly evolving and has been observed to change with age. The presence of unusually early microbiome aging patterns, relative to chronological age, could potentially signal altered susceptibility for age-related diseases. Conversely, a "young" microbiome might offer clues on how to decelerate the aging process1.


New machine learning tool predicts devastating intestinal disease in premature infants

#artificialintelligence

Necrotizing enterocolitis (NEC) is a life-threatening intestinal disease of prematurity. Characterized by sudden and progressive intestinal inflammation and tissue death, it affects up to 11,000 premature infants in the United States annually, and 15-30% of affected babies die from NEC. Survivors often face long-term intestinal and neurodevelopmental complications. Researchers from Columbia Engineering and the University of Pittsburgh have developed a sensitive and specific early warning system for predicting NEC in premature infants before the disease occurs. The prototype predicts NEC accurately and early, using stool microbiome features combined with clinical and demographic information. The pilot study was presented virtually on July 23 at ACM CHIL 2020.


New Machine Learning Tool Predicts Devastating Intestinal Disease in Premature Infants

#artificialintelligence

Necrotizing enterocolitis (NEC) is a life-threatening intestinal disease of prematurity. Characterized by sudden and progressive intestinal inflammation and tissue death, it affects up to 11,000 premature infants in the United States annually, and 15-30 percent of affected babies die from NEC. Survivors often face long-term intestinal and neurodevelopmental complications. Researchers from Columbia Engineering and the University of Pittsburgh have developed a sensitive and specific early warning system for predicting NEC in premature infants before the disease occurs. The prototype predicts NEC accurately and early, using stool microbiome features combined with clinical and demographic information. The pilot study was presented virtually on July 23 at ACM CHIL 2020.


Advancing Microbiome Research Through Data Collaboration

#artificialintelligence

The National Microbiome Data Collaborative (NMDC), a new initiative aimed at empowering microbiome research, is gearing up its pilot phase after receiving $10 million from the U.S. Department of Energy (DOE) Office of Science. Spearheaded by Lawrence Berkeley National Laboratory (Berkeley Lab), in partnership with Los Alamos (LANL), Oak Ridge (ORNL), and Pacific Northwest (PNNL) national laboratories, the NMDC will leverage DOE's existing data-science resources and high-performance computing systems to develop a framework that facilitates more efficient use of microbiome data for applications in energy, environment, health, and agriculture. Nearly every ecosystem and organism on Earth hosts a diverse community of microorganisms – its microbiome. Yet we know little about the functions of individual microbes, let alone how they interact with each other, their hosts, or their environments, and how their activity varies over time or in response to perturbations. The past decade has seen tremendous advances in genome and metagenome DNA-sequencing technologies, which has led to an unprecedented volume of microbiome data being generated.


Global forensic geolocation with deep neural networks

Grantham, Neal S., Reich, Brian J., Laber, Eric B., Pacifici, Krishna, Dunn, Robert R., Fierer, Noah, Gebert, Matthew, Allwood, Julia S., Faith, Seth A.

arXiv.org Machine Learning

An important problem in forensic analyses is identifying the provenance of materials at a crime scene, such as biological material on a piece of clothing. This procedure, known as geolocation, is conventionally guided by expert knowledge of the biological evidence and therefore tends to be application-specific, labor-intensive, and subjective. Purely data-driven methods have yet to be fully realized due in part to the lack of a sufficiently rich data source. However, high-throughput sequencing technologies are able to identify tens of thousands of microbial taxa using DNA recovered from a single swab collected from nearly any object or surface. We present a new algorithm for geolocation that aggregates over an ensemble of deep neural network classifiers trained on randomly-generated Voronoi partitions of a spatial domain. We apply the algorithm to fungi present in each of 1300 dust samples collected across the continental United States and then to a global dataset of dust samples from 28 countries. Our algorithm makes remarkably good point predictions with more than half of the geolocation errors under 100 kilometers for the continental analysis and nearly 90% classification accuracy of a sample's country of origin for the global analysis. We suggest that the effectiveness of this model sets the stage for a new, quantitative approach to forensic geolocation.